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ABSTRACT 

Motivation: Recently there has been increasing interest in the effects 
of cell mixture on the measurement of DNA methylation, specifically 
the extent to which small perturbations in cell mixture proportions can 
register as changes in DNA methylation. A recently published set of 
statistical methods exploits this association to infer changes in cell 
mixture proportions, and these methods are presently being applied 
to adjust for cell mixture effect in the context of epigenome-wide as- 
sociation studies. However, these adjustments require the existence 
of reference datasets, which may be laborious or expensive to collect. 
For some tissues such as placenta, saliva, adipose or tumor tissue, the 
relevant underlying cell types may not be known. 
Results: We propose a method for conducting epigenome-wide as- 
sociation studies analysis when a reference dataset is unavailable, 
including a bootstrap method for estimating standard errors. We dem- 
onstrate via simulation study and several real data analyses that our 
proposed method can perform as well as or better than methods that 
make explicit use of reference datasets. In particular, it may adjust for 
detailed cell type differences that may be unavailable even in existing 
reference datasets. 

Availability and implementation: Software is available in the R 
package RefFreeEWAS. Data for three of four examples were 
obtained from Gene Expression Omnibus (GEO), accession numbers 
GSE37008, GSE42861 and GSE30601, while reference data were 
obtained from GEO accession number GSE39981. 
Contact: andres.houseman@oregonstate.edu 
Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

Recently there has been increasing interest in the effects of cell 
mixture on the measurement of DNA methylation, specifically 
the extent to which small perturbations in cell mixture propor- 
tions can register as changes in DNA methylation (Adalsteinsson 
et aL, 2012; Bock, 2012; Houseman et al., 2012; Lam et ai, 2012; 
Liu et aL, 20\3; Reinius et al., 2012). DNA methylation, tightly 
associated with alterations in the nucleosome DNA scaffold (and 
hence chromatin), is in part responsible for coordination of gene 
expression in individual cells (Ji et aL, 201 1; Khavari et aL, 201 1; 
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Natoh, 201 1). It is now appreciated that differentially methylated 
DNA regions (DMRs) distinguish cell lineages with high sensi- 
tivity and specificity (Baron et aL, 2006), and considerable re- 
search is now underway to delineate precise DMRs that define 
and specify a particular cell lineage. A recently pubhshed set of 
statistical methods exploits this association to infer changes in 
cell mixture proportions solely on the basis of a DNA methyla- 
tion profile (Houseman et aL, 2012). These methods may broadly 
be conceived as the projection of DNA methylation data from a 
target dataset Si onto a reference dataset Sq consisting of DNA 
methylation profiles for isolated cell types. For example, S\ may 
consist of DNA methylation measured in whole blood from a 
case-control study (Langevin et aL, Liu et aL, 2013; 

Koestler et aL, 2012; Wang et aL, 2013), while Sq may consist 
of corresponding DNA methylation profiles from isolated leuko- 
cyte cell types (e.g. CD4+ leukocytes, B cell lymphocytes, gran- 
ulocytes) as provided by Houseman et aL (2012) and Reinius 
et aL (2012). Recently, some investigators have used these meth- 
ods to adjust for cell mixture effect in the context of epigenome- 
wide association studies (EWAS, Liu et aL, 2013). However, 
these adjustments require the existence of reference datasets Sq, 
and these sets may be laborious or expensive to collect. In add- 
ition, for some tissues such as placenta, saliva, adipose or tumor 
tissue, the relevant underlying cell types may not be known. In 
this article, we propose a method for conducting EWAS analysis 
when a reference dataset is unavailable, demonstrating that it can 
perform as well as the methods that make explicit use of a ref- 
erence dataset. 



2 STATISTICAL METHODS 

Our proposed method, closely related to surrogate variable ana- 
lysis (SVA, Leek and Storey, 2007), relies on a simple projection 
based on singular value decomposition (SVD), as does SVA. In 
SVA, the residuals of a linear model are decomposed into a 
factor-analytic structure and the factors are used subsequently 
in a regression model, with iteration resulting in a final set of 
surrogate variables. Our approach, which does not require iter- 
ation to obtain estimates, includes unadjusted linear coefficient 
estimates as columns of the matrix to be decomposed. As 
we demonstrate later in the text, this construction associ- 
ates the residuals of the unadjusted model with the unadjusted 
coefficient estimates in a manner consistent with a linear mixing 
assumption. 
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We assume DNA methylation array results Y, an m x n 
matrix representing DNA methylation measurements for m 
CpG sites and n subjects. We assume that the measurements 
are on a 'beta value' scale, having an interpretation as estimates 
of the proportion of methylated molecules corresponding to a 
given locus. In addition, we assume Sinn x p matrix X of covari- 
ates, including an intercept (in the first column), the phenotype 
of interest and potential confounders. The standard unadjusted 
EWAS analysis (on beta values) posits the linear model 

Y = B*X^ + E* (1) 

where B* is an m x /? matrix of coefficients and E* is an m x w 
matrix of errors. Generally, analysis proceeds as if the rows of E* 
were independent, although permutation-based inference is 
sometimes used to account for potential correlation among 
rows, and B* can sometimes include 'surrogate variable' con- 
founders (Leek and Storey, 2007; Teschendorff et al., 2011) 
that account for technical error. However, DNA methylation 
effects may be mediated through covariate effects on cell mix- 
tures, i.e. n = Xr + S and Y = BX^ + MO^ + E, where O is an 
n X k matrix of subject-specific cell proportions for k cell types 
(with rows summing to values <1), F is a p x k coefficient 
matrix representing cell-proportion effects, E is Sinn x k matrix 
of errors, B is an m x matrix of direct epigenetic effects (not 
mediated by effects on cell type), M is an m x matrix of cell- 
specific mean methylation values (falling between 0 and 1) and E 
is an m X « matrix of errors. The goal of an adjusted EWAS 
analysis is to estimate the direct effects B. Note that M can be 
obtained from reference data, but it is unknown if a reference 
dataset does not exist. We explicitly posit a model on the beta- 
value scale because the effects are expected to be linear and addi- 
tive only on this scale, not on the M-value scale often used (Liu 
et al., 2013). Substitution results in the following linear model: 

Y = (B + Mr^)X^ + MS^ + E (2) 

where it becomes evident that B* = B + MF^ and 
E* = MS^ + E. Although a naive analysis would treat the 
error matrix E* as having independent rows, an alternative 
model is to assume a factor-analytic structure on E* or E, 
specifically 

E = LU^ + O (3) 

where L is an m x g matrix of CpG-specific factor loadings, U is 
Sin n X q matrix of latent effects and O is an m x « matrix of 
'uniqueness' errors. This formulation is implicit in methods such 
as SVA (Leek and Storey, 2007) and independent surrogate vari- 
able analysis (ISVA, Teschendorff et al., 2011), which are tech- 
niques proposed for addressing batch effects and confounders as 
well as cell mixture effects, although neither SVA nor ISVA ex- 
plicitly posits this structure. Substituting (2) in (3) results in the 
following model: 

Y = (B + MF^)X^ + (M, L)(S, U)^ + O (4) 

which expresses the explicit dependence of the latent structure of 
E* on the unknown cell-specific methylation matrix M. M ap- 
pears twice in (4), once in the error structure, and once as a 
random effect on X. The SVA method extracts latent subject- 
specific effects using an SVD, which might also be used in this 



formulation as well, as long as the double appearance of M can 
be addressed. Careful inspection of (4) reveals that 

(B*, E*) = (M, L)[(F, O^x,)^, (S, U)T] + (B, O^x,) + O 
= AU^ + (B,0,,x,) + O 

for some m x {k-\- q) matrix A = (M, L) and {k -\- q) x {p -\- n) 
matrix U = [(F, O^x^)^, (S, U)^], with U having at least q or- 
thogonal columns. Although the k rows of the biologically deter- 
mined matrix [F^, S^] need not be orthogonal, the products MS 
and MF^ in (2) are not fully identified, in the sense that 
MS = MAA"^S and MF^ = MAA"^F^ for any invertible 
k xk matrix A, including any that orthogonalizes the rows 
of^[F^,S^]; thus, it is possible to find a fully orthogonal 
U that still results in quantities that satisfy (2) and yield an 
identifiable B. 

Motivated by this observation, we propose applying an SVD 
on (B ,E ) after fitting the unadjusted model (1); that is, we 
compute the SVD of the matrix obtained from the unadjusted 
model by concatenating the estimated coefficient matrix with the 
estimated residual matrix. Specifically, with dimension d fixed 
(d > k -\- q), WQ find m x d loading matrix A, (p -\- n) x d latent 
variable matrix U and m x {p -\-n) uniqueness error matrix Q> 
such that 

(B , E ) = AU + O 

and U U = I^. This is easily achieved by selecting the first d 
terms of the SVD of (B ,E ), i.e. the SVD terms corresponding 
to the d largest singular values. Specifically, the SVD pro- 
duces (B , E ) = LAU + L'A'U'^, where A is a diagonal dx d 
matrix. A' is a diagonal {p + n — d)x(p-\-n — d) matrix, 
L L = U U = I^, L'^L' = U'^U' = Ip^n-d and every diagonal 
element of A' is less than every diagonal element of A; A is 
thus obtained as LA. As M is a submatrix of A, we propose 
the following estimator for B, obtained as the residual of the 
projection of B* onto the column space of A: 

B = B* - A(A^A)-^A^B* (5) 

For any no n- singular dx d matrix D, 

(AD)[(AD)^(AD)]-VaD)^ = A(A^A)-^A^ 

so that (5) is independent of the scales chosen for each column of 
A. Although it is impossible to distinguish M from L within A, L 
has an interpretation that is identical with the surrogate variables 
extracted by SVA, and that SVA would also be unable to distin- 
guish M from L. 

Although this estimator is motivated by an explicit statistical 
model, its construction is somewhat ad hoc. Inference, there- 
fore, demands an appropriate bootstrap estimate of coefficient 
standard errors. To adequately account for correlation in the 
error structure, we propose the following approach. First, 
B* = B + MF^ represents all systematic variation (cell-mediated 
and non-cell-mediated) and E* = MS^ + E represents all unex- 
plained variation (in both cell composition and non-cell- 
mediated variation), so that sampling with replacement from 
the columns of E* should form the basis of the bootstrap. 
However, the variance of the elements of E* will depend on 
the corresponding elements of M* = B*X^, as the biologically 
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determined values are approximately beta-distributed. 
[M* = E(Y|X) is a matrix of mean values /x*-, not the cell-specific 
methylation matrix M.] As the variance of a beta-distributed 
variable with mean /x scales by /x(l — /x), we obtain a matrix 
Q of mean-standardized errors by dividing each element ^* of 
E by [/2*(1 — /2*)]^/^, where /2* is the corresponding element of 
M . Thus, we construct bootstrap sample Y^^^ as 
= M* + E*^'\ where the elements of E*^'^ are obtained by 
first sampling with replacement from the columns of Q , then 
multiplying the result, element-wise, by the elements 
[/2*(1 — /2*)]^^^ (of the un-resampled matrix). This standardiza- 
tion approach preserves the relationship between the mean and 
variance of a beta distribution, while sampling column-wise pre- 
serves correlations across CpGs. As shown in the Supplementary 
Material, the latter property allows the bootstrap samples to 
form the basis of an omnibus test of significance over the 
entire array. Bootstrap estimates B*^^^ and B^^^ are thus obtained 
by fitting (1) on Y^^\ and subsequently recomputing (5). 
Standard errors for the elements of B are obtained by 
calculating the corresponding standard deviations over boot- 
strap samples B^'\ re 1,...,J? (with, e.g. R = 25^ or 7^ = 500). 
Bootstrap-based standard errors for B can be obtained simply 
by computing standard deviations from the bootstrap samples 
obtained by fitting (1) to the bootstrap samples Y^''\ 

The proposed methods can easily be adapted to accommo- 
date missing values in Y as follows: (i) rows of B* corres- 
ponding to rows of Y having missing values can be estimated 
on a row-by-row basis via complete-case analysis; (ii) the 
SVD producing L and U can be obtained from completely 
observed rows of (B ,E ); A is then constructed from L for 
completely observed rows, or else for rows with missing 
values by projecting (B ,E ) onto U using a complete-case 
regression analysis. The bootstrap procedure introduces miss- 
ing values by sampling from replacement from the columns 
of Q , which will contain missing values for elements of Y 
that are missing. 

A remaining issue is appropriate selection of the dimension d. 
Teschendorff et al. (2011) propose a method for estimating the 
dimension of a latent surrogate variable using random matrix 
theory (RMT); application of their algorithm to the matrix E 
is one simple approach for estimating d. The simulation studies 
presented in the Supplementary Material suggest that this 
method performs well. As shown in the Supplementary 
Material, it outperforms a simple approach based on minimizing 
Akaike Information Criterion (AIC) and Bayesian Information 
Criterion (BIC), and data analysis results suggest that it produces 
estimates reasonably similar to those produced by the method 
suggested by Buja and Eyuboglu (1992) and implemented in the 
R package sva. In the case where E possesses missing values, 
only the completely observed rows are used. 

Finally, we once again emphasize that the proposed method- 
ology necessarily requires that analysis proceed on a linear 
(jS-value) scale, because the logit-transform used to construct 
M-values destroys the linear mixing assumption. However, we 
acknowledge that a substantial portion of error may occur on the 
logit scale, owing to the fact that the individual probes used to 
interrogate the molecular methylation states are expected to have 
approximately lognormal distributions. We address this concern 
in the next section. 



3 SIMULATION STUDY 

We conducted a simulation study to confirm that the proposed 
methodology produces unbiased results. Simulation parameters 
were chosen to produce datasets as similar as possible to realistic 
DNA methylation datasets, although the dimensions were small 
enough to make the simulation study feasible. Fixing m= 1000, 
n = 250 and k = 4, we constructed m x k matrix M by selecting 
each element of the first 250 rows of M as a Beta(0.25, 0.25) 
random variable, setting the remaining rows equal to zero; 
thus, the first 250 rows of M correspond to CpGs that are 
DMRs for k = 4 cell types. M was held constant over all simu- 
lations. The m X 2 matrix B was constructed as B = (pi , ^P2)^ 
where ^ e {0, 1} and B = (Pi,P2) were generated (once for all 
simulated datasets) from a 3-part mixture model as described 
in the Supplementary Material, in such a way that non-zero 
direct effects tended to occur only for CpGs with mid-range 
values, and their signs tended to correlate inversely with the 
intercept. Details appear in the Supplementary Material 
(Section I), but for each simulated dataset there were 30 negative 
effects, 23 positive effects and 947 null effects. For each simu- 
lated dataset, the phenotype of interest, Xi (i e {1,...,^}), was 
generated as a ^(—0.5,0.5) uniform random variable. Each 
row of the cell proportion matrix O was generated as 
Dirichlet((Oip), where the dispersion parameter p = 100 resulted 
in biologically plausible variation in cell proportions for a speci- 
men such as blood, o;/ = (0.2, 0.1, 0.1, 0.6)^ + r(0.05, 
-0.05,0,0)^x/, and r g {0, 1} was a simulation parameter con- 
trolling the strength of the phenotype effect on cell mixture. 
Generation of the error component of the model was imple- 
mented in a manner that allowed us to investigate the 
effects of error on different scales, logit and linear. First, we 
presume that biological variation occurs on the hnear scale, 
with error arising from a beta distribution. Thus, with 
(/xii, iXjni)^ = fij = Pi-\- P2Xi-\-M(Oi, we generated beta- 
distributed values bij ^ BetaipLggj, (1 — jlij)qj), jlji = min{max 
{/xji, 0}, 1}, qj = mm{n-' ELi My/(1 - /Ijd/O' -1,1} and 

0 = 0.008; in other words, we chose row-specific dispersion par- 
ameters such that the average variance of bji for row j over 

1 e {\,...,n} was about 0^. Simulations demonstrate relative in- 
sensitivity to 0, as shown in Supplementary Material (Section IV; 
Fig. S5(e)]. 

To obtain the measured methylation values, we added a micro- 
array error of the factor-analytic form described by (3) but incor- 
porated on a logit scale. Beta values yji are typically constructed 
as the ratio yjiMliyjiM + yjiu + where yjiM is the measurement 
obtained from a probe designed to interrogate a methylated mol- 
ecule, yjiu is the measurement obtained from the corresponding 
unmethylated probe and £ ~ 0 is a small constant chosen in ad- 
vance. A common assumption in microarray analysis is that 
measurements such as yjiM and yjnj are lognormally distributed; 
consequently, logit(yy/) ^ log(yy/M) — log(yy,x/) is normally dis- 
tributed, and we expect the technical error introduced by the 
microarray to occur on the logit scale. For a latent error com- 
ponent on the logit (M-value) scale of measurement, we set g = 2 
and generated the elements of the n x q matrix U = (ui, u„) in 
(3) as standard normal variables, while the corresponding L 
matrix was generated as ZA, with the elements of the m x ^ 
matrix Z generated as standard normal variables and 
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diag(A) = (0.25, 0.01)^. For the matrix ^ in (3), the elements of 
each row j were simulated as N(0,aj), with 
aj = 0.25^ — diag(A^) (i.e. the standard deviation of each value 
was 0.25, but the errors were correlated across CpGs). Thus, with 
microarray errors O = (cpjdji, the simulated value yji for row j 
and column / was generated as yji = [1 + exp{log(/)y/)— 
\og(\ — bji) -\- (pji}]~^ . Table 1 describes the combinations of 
simulation parameters ^ G {0, 1 } and re {0,1} used for each of 
four scenarios used to investigate basic properties of proposed 
estimator. Supplementary Figure S2(b) provides a clustering 
heatmap displaying a typical dataset simulated under 
scenario #1. In each simulation (except those conducted to com- 
pare methods of dimension estimation), the RMT method of 
Teschendorff et al. (2011) was used to estimate the latent 
dimension. 

The Supplementary Material describes additional simulations 
used to investigate several specific issues, as well as providing 
additional graphical results for the four main simulation scen- 
arios reported here. Supplementary Material (Section III) reports 
the effects of larger sample sizes (n = 500). Section IV describes a 
simulation experiment designed to investigate the effect of differ- 
ent scales of error variability on the accuracy of different meth- 
ods of dimension estimation (both in estimating the dimension 
itself and in the impact on Root-Mean-Squared-Error (RMSE)). 
Section IV also provides a comparison of our proposed method- 
ology with SVA. Section V describes a simulation experiment 
designed to investigate power and Type I error. In every scenario 
considered, we simulated 100 separate datasets, and for each 
simulated dataset, we used 250 bootstrap samples for inference. 

For simulation #1, Figure 1 compares slope estimates P2 
versus the SVA- adjusted variant of P2 versus P2 P2 versus 
P2, on each of the m= 1000 features. This figure demonstrates 
that although we expect the naive unadjusted estimator to pro- 
vide an unbiased estimator of the total effect P2^ its estimates of 
direct effects are somewhat biased in comparison to our pro- 
posed estimator especially for the null slopes. It also demon- 
strates that SVA produces biases similar to the unadjusted 
analysis. This figure is consistent with Table 2, which reports 
the total RMSE [e.g. the square root of the simulation average 
of YljLi (Pj ~ Pj)^] for s^ch of the four comparisons across 
all four scenarios. Table 2 suggests similar behavior for null 
direct effects when the cell mixture effect is non-null (simulation 
# 3). When the mixture effect is null, P2 P2 estimate P2 with 
about the same precision, as one would anticipate. Under 



simulation scenario #1, for each of 1000 features. Figure 2 
plots simulation SD versus median bootstrap estimate (across 
100 simulations) of the direct effect estimator. The bootstrap 
procedure appears tolerably unbiased, although our bootstrap 
standard error estimator yields apparently inflated estimates 
for some DMRs and a handful of non-DMR CpGs having 
non-null effect. The two very biased estimates result from inter- 
cepts lying near the zero boundary for mean methylation /x, re- 
sulting in non-linear effects (because of truncation) for some 
subjects having strongly negative values of x; in the 
Supplementary Material we provide evidence that the bias de- 
creases in larger samples (Section III). Table 2 reports the median 
(over CpGs) of the ratio of median bootstrap standard error 
(over simulations) to simulation SD for all four scenarios, for 
both P2 and P2, standard errors for the latter being computed 
using the standard linear model theory approach. Although the 
proposed bootstrap standard error methodology is imperfect, it 




'Bit '6 <si oat 
TrueCoeffidBfll 



True Coflffioient 



■o.nt 

True Ooeffiaant 



Fig. 1. Simulation 1: estimated effect by true effect. Comparison of slope 
estimates: true direct effect {P2) versus its estimate (P2), true direct effect 
versus the SVA-adjusted estimate and true direct effect (P2) versus the 
unadjusted effect (pj)- Squares indicate DMRs. Red indicates non-null 
CpGs. Black squares represent non-null DMRs 



Table 2. Simulation results summary 



Table 1. Simulation scenarios 



Sim 




T 


Description 


number 








1 


1 


1 


53 non-null direct eff , non-null cell mixture eff 


2 


1 


0 


53 non-null direct eff, null cell mixture eff 


3 


0 


1 


0 non-null direct eff, non-null cell mixture eff. 


4 


0 


0 


0 non-null direct eff, null cell mixture eff 



Note: Description of simulation parameters: ^ controls the strength of the direct 
(non-cell-mediated) effect on methylation; t controls the strength of the effect of 
covariate on cell-mixtures. 



Sim RMSE SE inflation 





BB 


BB* 


BBsv A 




B 


5* 


1 


0.0077 


0.0171 


0.0148 


0.0091 


1.04 


0.89 


2 


0.0063 


0.0094 


0.0054 


0.0094 


1.05 


0.87 


3 


0.0070 


0.0171 


0.0150 


0.0090 


0.99 


0.89 


4 


0.0054 


0.0093 


0.0053 


0.0093 


0.99 


0.87 



Note: BB, the RMSE for ^2 versus its estimate (^2); BE*, the RMSE for ^2 versus 
the unadjusted P2, BBsv a,- the RMSE for ^2 versus the SVA-adjusted variant of 
P2, B*B*, the RMSE for P2 versus its estimate (P2, SE Inflation is the median (over 
all m CpGs) of medsim{SEj)/SSDj, where the median SE is taken over simulations 
for each CpG j and similarly for the simulation standard deviation SSD. 
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Fig. 2. Simulation 1: bootstrap standard error by simulation standard 
deviation for B2. To increase legibility of the plot, SE estimates for two 
CpGs producing extreme bias have been moved to the left, as indicated 



appears to be as good as or better than the standard asymptotic 
methods used to compute standard errors for unadjusted effect 
estimates B . Supplementary Material (Section III) provides 
plots similar to those provided in Figures 1 and 2 for simulation 
scenarios #2, #3 and #4; they are consistent with the results and 
interpretations given here. 

The results of the simulation experiment described in Section 
IV suggest that the RMT method of dimension estimation has 
good accuracy in estimating the correct latent variable dimension 
when the magnitude of the latent effects is sufficiently large rela- 
tive to other error components, superior to simpler methods 
based on AIC and BIC. RMT was able, in general, to estimate 
the dimension correctly even though the errors were incorpo- 
rated on different scales (linear beta and logit M- value). In add- 
ition, the results suggest that when the microarray variation is 
relatively large, the reference-free and SVA methods have about 
the same level of error, presumably because the microarray error 
swamps the biological error. However, when the microarray 
error is smaller [but still consistent with realistic datasets, as 
shown in Supplementary Fig. S2(a)], our proposed reference- 
free method outperforms SVA, particular in estimating coeffi- 
cients for DMRs; this latter phenomenon presumably occurs 
because the cell-mixture property is explicitly used in supervising 
the deconvolution. Section V describes a simulation experiment 
to investigate power and Type I error, using a bootstrap-based 
method for testing omnibus significance. It suggests appropriate 
Type I error control when P2 = 0, and reasonable power when 
the effects of P2 are large enough. 



4 DATA EXAMPLES 

We demonstrate our proposed methodology on four datasets. 
The first consists of Illumina Infinium HumanMethylation450 
BeadChip array data on which bisulfite-converted DNA from 
whole blood was hybridized; the data were obtained from 
Gene Expression Omnibus (GEO), Accession number 
GSE42861, and consisted of n = 6^9 subjects: 354 rheumatoid 
arthritis patients (cases) and 335 normal controls, originally 



published by Liu et al. (2013). Using the data available on 
GEO, we obtained four different estimates for the difference in 
DNA methylation measured on the beta scale between case and 
control at 384410 autosomal CpGs whose Infinium probes con- 
tained no single nucleotide polymorphism (SNP) and had no 
SNP at a flanking G site ('non-SNP CpG sites'). In the 
unadjusted analysis, we simply applied the limma procedure 
(Smyth, 2004), with design matrix consisting of an intercept 
and an indicator variable for case status. We then applied 
the method of Houseman et al. (2012) to data from 387 
CpG sites overlapping between no n- SNP CpG sites on the 
HumanMethylation450 and the 500 leukocyte differentially 
methylated regions made available publicly to infer leukocyte 
proportions (the full Illumina Infinium HumanMethylation27 
dataset is available on GEO, Accession number GSE39981); in 
the reference-based analysis, we used limma to estimate case- 
control differences adjusted for leukocyte type by using five of 
the six available types: B-cell, CD4+ T, CD8+T, granulocyte 
and NK (monocyte proportions were dropped to avoid an ill- 
conditioned design matrix). In the SVA-adjusted approach, we 
used the R package SVA (version 3.6.0) both to compute the 
dimension of the surrogate variables and to determine the sur- 
rogate variables themselves. Using the method of Buja and 
Eyuboglu (1992) implemented in sva, we found d=53 surrogate 
variables; after estimating the 53 surrogate variables, we adjusted 
for them using limma in a manner similar to the previous ana- 
lysis. Finally, we applied our proposed reference- free analysis, 
with X equal to the same design matrix used in the unadjusted 
analysis. In the latter analysis, the latent variable dimension was 
estimated to be 37 by the RMT method of Teschendorff et al. 
(2011); because simulations suggest accurate dimension estima- 
tion by the RMT method, we used d=31. 500 bootstrap samples 
were used for inference. The DNA methylation dataset available 
on GEO contains no missing values. Figure 3 shows volcano 
plots of the arthritis case coefficient for the three different ana- 
lyses, demonstrating diminished significant for both adjusted 
analyses shown in the figure (reference -based and reference- 
free). Interestingly, significance for the reference-free analysis is 
diminished relative to the reference-based analysis, suggesting 
that the six leukocyte types profiled by Houseman et al. (2012) 
and available in GEO Accession GSE39981 may be insufficient 
for analysis of blood data. This interpretation is further rein- 
forced by Figure 4, which shows reference-free coefficient esti- 
mates by their corresponding reference-based estimates; there is 
general agreement between methods for coefficients with larger 
magnitude, except for a single CpG whose magnitude appears 
larger by the reference-based method than by the reference-free 
method; in general, for the relatively null CpGs, the reference- 
based method produces estimates of larger magnitude than the 
reference-free method. Supplementary Material (Section VI) pro- 
vides volcano and scatter plots similar to those shown in 
Figures 3 and 4, but showing the SVA results. In addition. 
Supplementary Figure S9 shows a comparison of RMSE be- 
tween our reference-free approach and SVA, where the 500- 
DMR reference-based analysis was used as a presumed gold 
standard. In general, the SVA results were dissimilar from 
both of the other adjusted analyses, with greater significance 
relative to both, and more similarity with the unadjusted ana- 
lysis. This suggests inadequate adjustment by SVA. SVA results 
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Fig. 3. Arthritis dataset: volcano plots. Volcano plots for B2 unadjusted 
for leukocyte composition, adjusted using the reference-based method 
that adjusts for six estimated cell type proportions and adjusted using 
the proposed reference-free method with <i= 20. Red indicates 387 leuko- 
cyte DMRs (overlap between 450K array and 500 CpGs published by 
Houseman et al., 2012) 
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Fig. 4. Arthritis dataset: reference-based versus reference-free, compari- 
son of reference-free coefficient estimates P2 with the corresponding ref- 
erence-based estimates 



with d=31 were similar to those obtained using d=53 (data not 
shown in detail, but a summary appears in Supplementary Fig. 
S9). For each of the four analysis types, Table 3 summarizes 
overall significance as measured by g-value methodology 
(Storey et ai, 2004) implemented in the R package qvalue, includ- 
ing an estimate of ttq, the proportion of nulls, as well as the 
number of CpG sites for which q<OA. Interestingly, the un- 
adjusted and reference-based analyses produced about the 
same estimated proportion of null CpGs as well as a relatively 
large number of CpGs for which q<OA. Despite apparently 
increased significance shown in the volcano plot appearing in 
the Supplementary Material, the SVA-adjusted approach pro- 
duced a higher value of ttq and fewer CpGs (though still a 



Table 3. Summary of significance for examples 



Dataset Number 
of CpGs 


Analysis 


Pr(null) 


#(^<0.1) 




Unadj 


0.20 


3 36227 




Ref-based 


0.24 


312133 


Blood (Arthritis) 3 84410 


SVA-adj 


0.41 


2 07 569 




Ref-free 


0.84 


0 




Unadj 


0.43 


3993 


Placenta (SGA) 21 551 


SVA-adj 


0.90 


0 




Ref-free 


0.92 


0 


Gastric (Tumor versus 


Unadj 


0.17 


24151 


SVA-adj 


0.13 


25 681 


non-malignant) 


Ref-free 


0.71 


1787 



substantial number) for which q<OA. The reference-free analysis 
produced a much higher estimated value of ttq and no CpGs for 
which q<OA. Although there were no significant g- values after 
reference-free adjustment, an omnibus test of significance pro- 
posed in Section V of the Supplementary Material results in an 
overall P< 0.002 even after reference-free adjustment. As men- 
tioned in Leek and Storey (2007), g- values can be misleading 
when applied to multiple correlated tests. 

A possible explanation for decreased significance of the refer- 
ence-free method relative to the other methods is that the refer- 
ence-free approach led to less precise estimates. However, the 
median ratio of reference-free to reference-based standard error 
was 0.79, so that the reference-free approach led to estimates that 
were apparently more precise, overall, than the reference-based 
adjustment. Supplementary Material (Section VI) provides plots 
comparing estimates and standard errors for the three different 
methods. On the basis of these statistics and results, it appears 
that the reference-free method more precisely estimates effects 
that are smaller than those produced by the other two methods; 
thus, inflation of standard errors is an inadequate explanation. 
Another explanation for decreased significance of the reference- 
free method is that the reference-free approach potentially ac- 
counts for many more cell types than either the reference-based 
or reference -free approaches. The reference set provided by 
Houseman et al. (2012) differentiates granulocytes from other 
cell types, but does not differentiate neutrophils, basophils and 
eosinophils within the granulocyte category; it distinguishes 
CD4+ and CD8+ T-cells from other types of cells that are not 
T cell lymphocytes, but does not differentiate T helper cells, 
regulatory T cells or memory T cells. These types may be im- 
portant differentiators of rheumatoid arthritis. 

Our second analysis consists of array data for 92 independent 
peripheral blood mononuclear cell (PBMC) samples, assayed 
using the Illumina Infinium HumanMethylation27 {27 K) tech- 
nology, originally pubhshed in Lam et al. (2012) and available 
in GEO, Accession number GSE37008. For the purposes of this 
analysis, PBMC samples can be thought of as whole blood with 
granulocytes removed. In addition to DNA methylation data, 
complete blood count differential data were available for each 
sample, thus providing gold standard estimates for the fraction 
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of the PBMC sample consisting of monocytes, assumed to be one 
minus the fraction of lymphocytes. Using several different 
approaches applied to the subset of autosomal CpGs, we exam- 
ined the association between DNA methylation and the loga- 
rithm of il6 response to phorbol-12-myristate-13-acetatein ('log 
pma'), a potent cell division promoter. The first analysis was 
unadjusted; the second analysis was adjusted for monocyte frac- 
tion; the third and fourth analyses were adjusted for blood cell 
fractions estimated using the approach of Houseman et al. 
(2012), similar to the approach described above, with the top 
100 or 500 DMRs pubHshed in Houseman et al. (2012). 
Koestler et al. (2013) provide a detailed study of the application 
of reference-based estimates of cell proportion using this dataset. 
In the fifth approach, we adjusted for <i= 1 1 surrogate variables 
using SVA, with d obtained by the method of Buja and Eyuboglu 
(1992). In the sixth approach, we applied the reference-free ap- 
proach proposed in this article with d= 10, estimated via RMT. 
Both the third and fourth approaches produced similar results, 
and results reasonably similar to the second approach. The ref- 
erence-free approach produced slightly increased significance 
over approaches 2-4. The SVA results were similar to the refer- 
ence-free approach, although the resultant RMSE values were 
larger for the SVA approach than for the reference-free ap- 
proach, when monocyte- adjustment or reference-based adjust- 
ment was used as a gold standard [Supplementary Fig. SlO(o)]. 
The unadjusted analysis produced results that were substantially 
more significant than those produced by the other four 
approaches. Details of this analysis appear in Section VII of 
the Supplementary Material. In summary, our proposed refer- 
ence-free approach produces results similar to (though slightly 
less variable than) those produced by SVA, to reference-based 
adjustments as well as adjustment by a known (though coarsely 
differentiated) gold standard, but quite distinct from the un- 
adjusted approach. 

Our third analysis consists of 27K array data for 176 placenta 
samples originally published in Banister et al. (201 1). All n=\16 
infants considered in this analysis were of gestational age greater 
than 37 weeks. Data were adjusted for Bead Chip effect using 
ComBat (Johnson et al, 2007). Maternal- age adjusted differ- 
ences in placental DNA methylation between 52 small-for-gesta- 
tional age (SGA) infants and 124 normal infants at 21 551 
autosomal non-SNP CpG loci were estimated using three meth- 
ods. In the unadjusted analysis, limma was used with design 
matrix consisting of an intercept, an indicator variable for 
SGA, and maternal age. In the SVA-adjusted analysis, we add- 
itionally adjusted for <i= 13 surrogate variables, with d obtained 
by the method of Buja and Eyuboglu (1992). In the reference- free 
analysis, we used our proposed method with d=\2 (estimated 
via RMT). The volcano plots shown in Figure 5 suggest that the 
SVA and reference-free analyses produce modestly diminished 
significance compared with the unadjusted analysis. Figure 6 
shows adjusted effect estimates by their corresponding un- 
adjusted estimates and by SVA-adjusted estimates; the figure 
suggests that a substantial fraction of CpGs demonstrates 
slightly larger effect magnitude when compared with their ad- 
justed counterparts. As shown in Figures 5 and 6, as well as 
Supplementary Material (Section VIII), SVA and the reference- 
free methods produce almost identical results. In addition. 
Table 3 suggests diminished significance of results in the 
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Fig. 5. Placenta dataset: volcano plots. Volcano plots for B? unadjusted 
for leukocyte composition and adjusted using the proposed reference-free 
method with d=\2 
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Fig. 6. Placenta dataset: comparison of reference-free adjustment with 
unadjusted and SVA- Adjusted. Comparison of reference-free coefficient 
estimates ^2 with the corresponding unadjusted and SVA-adjusted 
estimates 



unadjusted analysis. Overall, the analysis suggests that much of 
the effect of SGA on DNA methylation may be explained by cell 
mixture, but that the mixture effect is substantially smaller than 
that occurring for DNA methylation measured in blood. 

Our final analysis consisted of 27K data for 203 gastric tumors 
and 94 gastric non-mahgnant samples, originally pubhshed by 
Zouridis et al. (2012) and available in GEO, Accession number 
GSE30601. In each of the three analyses of DNA methylation at 
autosomal CpGs, we compared tumors to non-mahgnant sam- 
ples. The first analysis was unadjusted; the second analysis was 
adjusted for 27 surrogate variables (with d=H determined by 
the method of Buja and Eyuboglu, 1992); in the third, we applied 
the reference-free approach proposed in this article, with d = 24, 
obtained via RMT. In general, the unadjusted, SVA and refer- 
ence-free approaches produced different results. The SVA and 
reference-free approaches differed systematically (Wilcoxon 
P< 0.0001) at CpGs mapped to polycomb target genes 
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(Bracken et al., 2006; Lee et al., 2006; Schlesinger et al., 2007; 
Squazzo et al., 2006), with the reference-free approach demon- 
strating greater hypomethylation in tumors at PcG targets. Thus, 
compared with the other approaches, the reference-free method 
may demonstrate superiority in identifying biologically relevant 
alterations of DNA methylation in highly heterogeneous tumor 
samples. A summary of significance appears in Table 3; details of 
the analysis and graphical results appear in Supplementary 
Material (Section IX). 



5 DISCUSSION AND CONCLUSIONS 

We have proposed a reference-free method of conducting EWAS 
while adjusting for cell mixture. Following on work by 
Houseman et al. (2012), this method posits a statistical model 
that involves a latent variable representing mean methylation, 
together with a factor-analytic error model consistent with the 
surrogate variable approaches of Leek and Storey (2007) and 
Teschendorff et al. (2011). We have also proposed a companion 
bootstrap methodology for estimating standard errors. 

Our simulation studies suggest that our proposed approach 
returns reasonably unbiased estimates of the direct effect, i.e. 
the effect not mediated by cell type, and reasonable standard 
error estimates. They also suggest adequate control of Type I 
error and reasonable power to detect direct effects of sufficient 
magnitude. Our simulations also suggest that the RMT method 
of dimension estimation, proposed by Teschendorff et al. (201 1), 
performs well for estimating the dimension of the latent struc- 
ture. Finally, our method performs about as well as surrogate 
variable analysis (SVA) when technical error is of large magni- 
tude, and better than SVA when technical error is of smaller (but 
still realistic) magnitude. We have demonstrated our method on 
four separate datasets. The first was a rheumatoid arthritis 
dataset, where DNA methylation in blood shows substantial at- 
tenuation in effect after applying our reference-free approach, 
even in comparison to an approach similar to that used by Liu 
et al. (2013), wherein DNA methylation was adjusted for cell 
types profiled by Houseman et al. (2012) using methodology 
proposed in that paper. A possible explanation is that DNA 
methylation effects may be mediated by cell types not profiled 
by Houseman et al. (2012). The second set consisted of PBMC 
data and associated effects of il6 response to a potent tumor 
promoter, for which we demonstrate that our proposed refer- 
ence-free approach produces results similar to reference-based 
approaches but dissimilar from an unadjusted analysis. The 
third set was a placental dataset for which no reference data 
were available for component cell types. In this dataset, effects 
of SGA showed modest attenuation after applying our reference- 
free method. This suggests that the effects of growth restriction 
on DNA methylation are likely less mediated by changes in cell 
distribution, a result that one might anticipate in a solid tissue 
sample such as placental tissue. The fourth analysis compared 
gastric tumors with non-malignant gastric tissue, showing differ- 
ences in results between mixture-adjusted and unadjusted 
analyses, but with significant direct effects produced even in 
the adjusted analyses. In this last example, there is evidence 
that the results of the reference-free method may be more bio- 
logically meaningful than those produced by SVA. 



An interesting aspect of our methodology is its tendency 
to produce 'spikes' in volcano plots, such as those shown in 
Figure 5 or Supplementary Figure SI 2(a). These are caused by 
shrinkage of standard errors, particularly at DMRs, i.e. CpGs 
that drive the cell mixtures in A. Because these CpGs collectively 
borrow statistical strength from each other, their standard 
errors tend to be smaller. This phenomenon is evident in 
Supplementary Figures S7(b) and S 10(d), which compare refer- 
ence-free and unadjusted standard errors for the blood and 
PBMC datasets, respectively; these plots demonstrate that for 
many known leukocyte DMRs, the standard errors in the refer- 
ence-free method are noticeably smaller than the corresponding 
standard errors from the unadjusted analysis. The same phenom- 
enon is evident in Supplementary Figure S7(h), which plots the 
standard errors from the reference-adjusted analysis against the 
corresponding standard errors from the unadjusted analysis, and 
demonstrates shrinkage of standard errors at leukocyte DMRs. 

Unlike the SVA approaches of Leek and Storey (2007) and 
Teschendorff et al. (2011), we posit a specific data generation 
model that incorporates a factor-analytic structure that is impli- 
cit in the previously pubHshed methods. However, we incorpor- 
ate all CpG features in the factor-analytic structure, rather than 
attempting to select a subset of features that are optimally in- 
formative. While there may be some loss of precision in using all 
CpG features present on a given array, we view this as an ac- 
ceptable sacrifice to accommodate an agnostic approach that 
permits any CpG to serve as a DMR. In addition, the surrogate 
variables for which our model implicitly adjusts have an explicit 
mixing interpretation. Our simulations suggest equivalent or 
better results using our proposed method; the data analyses dem- 
onstrate that our method can produce results that are similar to 
SVA, as in the PBMC and placenta examples, or results that are 
distinct, as in the arthritis and gastric tumor examples. In the 
gastric tumor analysis, there is a suggestion that our reference- 
free approach may produce results slightly more consistent with 
known biology. The reference-free approach is able to do this 
without the somewhat time-consuming comparison of candidate 
surrogate variables with potential confounders; however, it 
achieves this result by essentially projecting unadjusted effect 
estimates on an error matrix, with a linearity assumption that, 
while biologically motivated, may sometimes fail. In addition, 
our method is designed to deconvolute cell mixtures; it is not 
designed to uncover confounders that are specific to technical 
sources of variation. Thus, SVA might be applied on an M- 
value scale to extract surrogate variables that are specific only 
for technical variation (i.e. by removing from the set of estimated 
surrogate variables those that are associated with biological con- 
founders) and are subsequently included in our reference-free 
approach. However, it is somewhat unclear how to proceed 
when batches are confounded with phenotypes; more detailed 
research is needed to develop methods that adjust simultaneously 
for linear mixing effects and non-linear technical effects, using 
datasets having fully annotated technical data such as chip 
number and position. However, we view this article as a concrete 
step in that direction, and our data samples demonstrate the 
practicality of our method. 

Our approach now offers the possibility of conducting EWAS 
analysis adjusting for mediation by cell type even without the 
existence of reference datasets that may be expensive or infeasible 
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to collect. If widely adopted, it could pave the way for EWA 
studies that are more robust with higher potential for replication 
of results. 
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